Skip to content

Conversation

@albertzaharovits
Copy link
Contributor

@albertzaharovits albertzaharovits commented Aug 7, 2025

Every cluster state applied in the IndicesClusterStateService has the potential to chain a new RefCountingListener to a chain of such listeners. If the chain is too long, the unlucky thread that decreases the ref count to 0 for the head of the listeners chain, ends up calling each listener in turn, and, assuming all ref counts are hence decreased to 0, traversing the whole chain on its thread stack, possibly resulting in a Stackoverflow exception.

This fix chains max 8 RefCountingListener, the 11th one is forked on a generic thread when it gets to execution.

@albertzaharovits albertzaharovits self-assigned this Aug 7, 2025
@albertzaharovits albertzaharovits added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v9.2.0 v8.19.2 v9.1.2 labels Aug 7, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Aug 7, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @albertzaharovits, I've created a changelog YAML for you.

@albertzaharovits
Copy link
Contributor Author

Honestly, I think I prefer that every chained listener be executed on a generic thread, for code simplicity's sake.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we didn't extend the chain in the (overwhelmingly common) case where the cluster state update doesn't close any more shards.

Also can you cover this in a test?

lastClusterStateShardsClosedListener = new SubscribableListener<>();
currentClusterStateShardsClosedListeners = new RefCountingListener(lastClusterStateShardsClosedListener);
try {
previousShardsClosedListener.addListener(currentClusterStateShardsClosedListeners.acquire());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm are you sure we should move all this listener stuff below doApplyClusterState()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any impact to execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I've put it back at the original place.

@albertzaharovits
Copy link
Contributor Author

I'd rather we didn't extend the chain in the (overwhelmingly common) case where the cluster state update doesn't close any more shards.

Pushed 3a00599

@albertzaharovits
Copy link
Contributor Author

@DaveCTurner can you take another look please?

I've changed the code to avoid linking listeners when the applied cluster state doesn't close any shards.
I've also added a test that asserts that all the runnables before the oldest shard close listener that's not complete are run, while the others are not.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fcofdez
Copy link
Contributor

fcofdez commented Aug 26, 2025

@elasticmachine update branch

@fcofdez
Copy link
Contributor

fcofdez commented Aug 26, 2025

@elasticmachine test this

@fcofdez
Copy link
Contributor

fcofdez commented Aug 27, 2025

@elasticmachine update branch

@fcofdez fcofdez merged commit eb75ba3 into elastic:main Aug 27, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Coordination Meta label for Distributed Coordination team v8.19.4 v9.1.4 v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants